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Abstract. We apply multiple testing procedures to the valida- 
tion of estimated default probabilities in credit rating systems. The 
goal is to identify rating classes for which the probability of default 
is estimated inaccurately, while still maintaining a predefined level 
of committing type I errors as measured by the familywise error 
rate (EWER) and the false discovery rate (FDR). For EWER, we 
also consider procedures that take possible discreteness of the data 
resp. test statistics into account. The performance of these meth- 
ods is illustrated in a simulation setting and for empirical default 
data. 



1. Introduction 

Banks use rating systems to classify borrowers according to their 
credit risk. These systems form the basis for pricing credits and for 
determining risk premiums and capital requirements for the bank, cf. 
[27]. One of the key components in this set-up is the probability of 
default (PD), i.e. the likelihood with which borrowers will default in 
a prespecified time period. Banks that use an internal ratings-based 
(IRB) approach as described in the Basel II framework, are required 
to report a PD estimate for each borrower. In practice, borrowers 
are grouped together into rating grades for which a pooled or average 
PD is calculated. Correct calibration of a rating system means that 
the respective PD estimates or forecasts are accurate. Inaccurate PD 
forecasts can lead to substantial losses, see ['2i ] and [ ]. Since correct 
calibration is crucial to the appropriate functioning of a rating system, 
banks are also required by regulatory authorities to validate their PD 
estimates by comparing the forecasted PD to realized default rates ([1, 
§463 and §464]). This process is also known as backtesting. Validation 
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of PD estimates can be carried out simultaneously for all rating grades 
in a joint statistical test or separately for each rating grade, depend- 
ing on whether an overall assessment or an in detail examination is 
intended, cf. [27]. In this paper we are primarily concerned with the 
latter aim. 

The goals of backtesting can vary and will depend on the viewpoint 
of the involved parties. If backtesting is performed with the aim of 
demonstrating calibration to a regulatory authority, the bank could 
be interested in controlling the probability that a correctly calibrated 
rating system is dismissed or has to be recalibrated. If the goal is to 
provide more detailed internal information for a bank's risk manage- 
ment it may be desirable to also consider more liberal or exploratory 
methods that generate early warnings that can help to identify and 
subsequently investigate potential shortcomings of the rating system. 

From a statistical viewpoint, PD validation could be described as 
the simultaneous assessment of the predictive quality of multiple prob- 
ability forecasts. In practice, the main statistical methods used for PD 
validation are (exact or asymptotic) binomial and chi-square tests as 
well as the so called 'normal test' and various 'traffic light approaches' 
(cf. [27] and [8] for more details). The binomial and normal tests as 
well as the traffic light approaches are applied separately to each rating 
grade whereas the chi-square or Hosmer-Lemeshow test is a global test 
that can asses several rating categories simultaneously. The normal test 
and the more exploratory traffic light approaches are multi-period tests 
which are based on normal approximations and can take dependen- 
cies into account. Blochlinger and Leippold ([ ]) develop a new global 
goodness-of-fit test for probability forecasts and apply this to empirical 
default data. Their test consists of two components and they show that 
the corresponding test statistic is asymptotically x^-distributed. 

There are several statistical issues associated with PD validation, of 
which we only mention the two major ones (for more details see e.g. 
[27] and [S]). Firstly, default data is usually sparse and sample sizes 
are often small. Combined with PD estimates that usually are very 
small numbers, this means that the respective hypothesis tests possess 
low power. Moreover, default events are generally not independent and 
therefore PD estimates and validation methods should take this into 
account. Since the main purpose of this paper is to introduce some 
new concepts to the problem of per-class PD validation we assume for 
the sake of simplicity and to limit the scope of this paper that default 
events are independent. However, we give some indication of how the 
methods described here can be adapted to correlated default data in 
section 6. 
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Our aim is to describe some statistical tools that can be used to 
provide in detail assessments of rating systems, specifically we are con- 
cerned with identifying conspicuous resp. miscalibrated rating classes. 
Suppose a rating system consists of 20 rating grades and for each grade 
a test is performed at the 5% significance level. If all null hypotheses 
are true, then the expected number of false rejection, i.e. erroneously 
detected miscalibrations will be one. If in addition, the respective test 
statistics are independent, then the probability of observing at least 
one false miscalibration finding will be 1 — (1 — 0.05)^*^ ~ 0.64, i.e. the 
probability of committing at least one type I error is far from being 
controlled at the 5% level. This and related phenomena is known as 
the multiple testing problem or the problem of multiple comparisons, 
see [21]. For PD validation, this means that even if all rating classes 
are perfectly calibrated, the chances of observing a significant finding 
resp. erroneously concluding that some classes were miscalibrated, is 
quite high. The problem therefore is to decide which of the significant 
classes can still be considered miscalibrated and which were identified 
merely due to performing a multiple number of tests. To the best of 
our knowledge, this problem has received little attention within the 
context of PD validation. Rauhmeier ([21]) takes the problem partly 
into account in the construction of an original test, which is based on 
the sum of the number of gradewise rejections. However, as this test is 
designed as an overall goodness-of-fit test, it can not identify single con- 
spicuous PD estimates. Huschens ([I'*]) considers several simultaneous 
tests and mentions that the Bonferroni procedure (see section 3.1) is 
inappropriate due to its conservacy. Since his considerations take place 
in an asymptotic setting however, he also emphasizes that these tests 
may produce inacceptable results for rating classes with sparse data. 
In fact, he poses the question how a simultaneous testing procedure 
could be developed that takes into account the sparseness of data in 
some rating classes and data richness in others. We attempt to give an 
answer to this question in section 3.4. 

Multiple testing procedures (in the sequel abbreviated as MTPs) 
provide a well-established methodological framework for dealing with 
multiplicity issues, with several monographs (cf. e.g. [13], [30] and 
[l' ]) and a large number of research papers available. While MTPs 
have been used in many areas of application such as clinical trials, mi- 
croarray experiments, astronomy and magnetic resonance imaging, the 
validation of PD (and more generally probability) forecasts constitutes 
to the best of our knowledge a novel field of application. 

The plan for this paper is as follows. In section 2 some further back- 
ground is given on PD validation and the associated testing problems. 
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Section 3 reviews some multiple testing procedures with a focus on 
discrete distributions. These procedures are apphed in a simulation 
study in section 4 and to empirical data in section 5. Following this, 
an extension to dependent defaults in a single-factor model is sketched. 
The paper concludes with a discussion in section 7. 

2. Notation and assumptions 

In this section we introduce some terminology and assumptions that 
will be used in the sequel. 

2.1. The backtesting approach. We consider credit portfolios con- 
sisting of a total number of borrowers who have been classified into 
K rating classes. Each rating class is associated with a true but un- 
known (average) PD pi, . . . ,pk as well as estimated PDs pdi, . . . ,pdK- 
The basic idea of the backtesting approach is to split the total sample 
into a training or estimation sample and a validation sample. 

(1) In the first step, the probability forecast resp. classifier is con- 
structed based on the training sample. In practice, the training 
sample usually consists of data collected up to some time point 
t and estimators for the default probability resp. rating classes 
are usually assigned to individual borrowers based on a vector 
X of features (covariates) associated with the borrower. Pop- 
ular models for the dependency of the default probability on 
X are logistic and probit regression but also nonlinear methods 
like neural networks and decision trees are used in this context, 
cf. e.g. [il]. Note that in this paper we are not concerned 
with the construction of PD forecasts resp. classifiers but only 
with assessing the accuracy of a given forecast. Therefore we 
assume in the sequel that this probability forecast has already 
been constructed. 

(2) The validation sample usually consists of data observed during 
some future time period, e.g. between t and t + 1. We denote by 
Uj the number of borrowers that were assigned to probability 
forecast pdj resp. rating class j (say at time t) and let oj denote 
the number of defaults observed in the rating class between t 
and t+1. Then the true probabilities of default can be estimated 
e.g. by the quantities oi/rii, . . . ox/nx and the quality of the 
probability forecast resp. classifier can be assessed by statistical 
tests as described in the introduction. 

2.2. Testing calibration hypotheses. For given ni, and Oi, . . 

it is to be decided, whether the probability forecasts pdi, . . . ,pdK are 
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correct. For j = 1, . . . , K and I = 1, . . . ,nj let Xij G {0, 1} denote the 
rv that indicates whether borrower / in rating grade j defaults {Xij = 1) 
or not {Xij = 0). We assume throughout this paper that 

(A) Xij ~ Bin(l,pj) and all Xij are independent. 

As mentioned in the introduction, independence between all default 
events is an unrealistic assumption. However, our primary goal is to 
describe some general MTP approaches to the calibration of PD fore- 
casts. For the clarity of exposition and to concentrate on the main 
concepts we therefore defer dealing with dependency issues to future 
work (see also section 6). In case of perfect probability forecasts we 
would have Oj ~ Bin{nj,pj), where 

Oj = Xij-\ h Xji.j 

and we define null hypotheses accordingly in this probability model as 

(1) : pj = pdj vs. Hi : pj ^ pdj, 

and we say that rating class j is calibrated correctly if Hq holds true. In 
the same spirit we call the probability forecast calibrated in the overall 
sense if the global hypothesis 

(P) Ho:=H]n---n 

holds true, i.e. if it is calibrated for all rating classes. Note that 
we consider throughout this paper two-sided hypotheses only. This 
can be interpreted as the viewpoint of the bank's risk manager who 
is interested in detecting both overly optimistic and overly pessimistic 
PD estimates, while regulatory authorities may focus only on one-sided 
tests that detect underestimation of PDs. However, the MTP approach 
introduced in section 3 can straightforwardly be adapted to the one- 
sided case. 

For the simulation experiments in section 4 it will be helpful to view 
the problem of forecasting PDs as a classification problem. Suppose 
that it is known that the true possible default probabilities are given 
by pii . . . ,Pk- In this case, the problem of PD forecasting becomes one 
of PD classification, i.e. deciding for each borrower which of the pj is 
true. We denote by rzi, . . . , uk the true number of borrowers in classes 
1, . . . ,K. Ideal forecasting resp. perfect classification would mean that 
rij = rij for j = 1, . . . , K. In reality we will usually encounter a certain 
amount of misclassification. To describe this we introduce 

riij := T^borrowers (truly) from class i that are classified as belonging to class j. 
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Therefore 

Uj = n-ij + ■ ■ ■ + riKj 

and it follows from assumption (A) that the distribution of Oj is given 
by a convolution of binomial distributions: 

Oj ~ * Bin{nij,pi). 

i=l 

The matrix = (^^ij)i<j,j<x is also known as the misclassification or 
confusion matrix (see e.g. [20]) and the expectations and variances of 
the absolute default frequencies can be expressed conveniently through 
the elements of N and the given PDs. 

For testing the grade- wise cahbration hypotheses Hq, . . . , Hq we use 
an exact binomial test, see also comment (ii) in section 3.2. For test- 
ing the global hypothesis Hq in the setting introduced above, a 
goodness-of-fit test based on the statistic 

^ ._ f (Q. - 

-2^ — ^7o — 

j=l ''j 

is commonly used, where Ej = rij ■ pj resp. Vj^ = rij — Pj) denote 
the expectation resp. variance of Oj under Hq. Hosmer and Lemeshow 
used a related statistic for assessing the fit of logistic regression models. 
In [1(3] they discuss two methods of grouping risks based on ranked 
probability estimates: 

(1) In the 'deciles of risk' approach, groups of equal numbers of 
risks are formed. 

(2) In the 'fixed cutpoint' approach, risks are mapped into classes 
determined by predefined probability cutpoints on the (0, 1) 
interval. This is essentially the approach usually taken in PD 
validation. 

Under appropriate asymptotic conditions (e.g. all Ej should be suffi- 
ciently large), T^l is approximately x^-distributed under Hq. It has 
been demonstrated in [17] that the deciles of risk approach yields a 
better approximation to the corresponding distribution than the 
fixed cutpoint approach, especially when the estimated probabilities 
are smaller than e.g. 0.2. For more details on the advantages and dis- 
advantages of Hosmer-Lemeshow type tests, see [15]. When the sample 
size is too small to justify the use of asymptotic methods (as is often 
the case for credit portfolios), the distribution of Thl under Hq can be 
determined by simulation, cf. [24]. The corresponding test can be seen 
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as an exact version of the HL-test which corrects for the finite sample 
size and is denoted by (HL) in the sequel 

3. A REVIEW OF SOME MULTIPLE TESTING PROCEDURES 

In order to hmit the scope of this paper, we confine our review to a 
selection of classical multiple testing procedures as well as the MinP 
approach. For a more complete treatment we refer to the literature on 
multiple testing cited in the introduction. 

We are interested in simultaneously testing a family ifg, . . . , Hq of 
null hypotheses while controlling the probability of one or more false 
rejections at a multiple level a. This probability is called the family- 
wise error rate (FWER). To be more precise, we require strong control 
of FWER, i.e. that FWER < a holds for all possible constellations 
of true and false hypotheses. The principal application we have in 
mind in the context of PD validation are hypotheses tests for binomial 
proportions (cf. section 2.2). From the validation viewpoint, it seems 
highly desirable that apart from controlling the FWER, the multiple 
testing method employed should possess high power in order to detect 
possible departures from calibration. 

In the sequel let pvi , . . . , pvx denote the p- values observed from test- 
ing hypotheses Hq, . . . , Hq and assume that these values are ordered 
pvi < ■ ■ ■ < pvK- 

3.1. Bonferroni-type methods. The Bonferroni method (in the se- 
quel abbreviated as (Bonf)) is a classical method that maintains control 
of the FWER. Adjusted p-values are defined by pv'j := max(ii' ■ pvj, 1) 
and all hypotheses with pv'j < a are rejected. 

Instead of using the (single-step) Bonferroni method one can use 
the more powerful Holm step-down (from the most significant to the 
least significant result) procedure (Hoi) which works the following way: 
Define adjusted p- values by 

pv[ := K ■ pvi, 

pv'2 := max(pt>^, {K — 1) ■ PV2), 
pv'^ := max(pt;2, {K -2)- pv-s), 

pv'j^ := mwc{pv'^_^,pvK) 

and again set the adjusted p- values exceeding 1, to 1. All hypotheses 
with pv'j < a can then be rejected. Another variant of Bonferroni-type 
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adjustment which is more powerful than Holm's procedure is Hom- 
mels (Hom) approach which is valid under independence or positive 
independence assumptions (for details refer to [' ']). 

All the procedures described above provide strong control of the 
FWER under certain circumstandes. For the Bonferroni and Holm 
procedure this holds true e.g. when the distribution functions of the 
p- values, considered as random variables PVi, . . . , PVk, are stochasti- 
cally larger under the respective null hypothesis than some uniformly 
distributed random variable, i.e. for i = 1,. . . ,K it holds P{PVi < 
u\Hq) < u for all u e (0,1), cf. [21]. However, as noted e.g. in [ ], 
these procedures can be very conservative, especially if the p- values are 
distributed discretely. Therefore it makes sense to investigate multiple 
testing procedures developed specifically for discrete distributions. 

3.2. The Min P approach for discrete distributions. Gutman and 
Hochberg review and compare the performance of several FWER con- 
trolling MTPs for discrete distributions (cf. [10] and the references 
cited therein). They investigate Tarone's method, two variants of a 
method by Roth, the method of Hommel and Krummenauer, the Min P 
method of Westfall and collaborators (see [2S] and the references cited 
therein) and an original method called TWWk- AH methods except the 
Min P method and the method of Hommel and Krummenauer lack a- 
consistency. This means that possibly a hypothesis cannot be rejected 
at some level ai but can be rejected at some lower level 02. In addition 
it is shown in ['' ] that the MinP method is universally more powerful 
than the method of Hommel and Krummenauer. Since a-consistency 
would seem to be a desirable property in the validation context consid- 
ered here, we concentrate in the sequel on the more powerful method 
of the two, namely the MinP approach. 

3.2.1. The single-step version. Suppose the distribution of min(PVi, . . . , PVk)-, 
when all null hypotheses are true, is available. For the single-step vari- 
ant the idea of the Min P approach is to define adjusted p- values by 

pv'j := P(min(PVi, . . . , PVk) < pvj) 

where pvi, . . . ,pvK are the p-values observed for the data, i.e. the 
jth adjusted p- value is the probability that the minimum p- value is 
smaller than the jth observed p-value. In [29] it is pointed out that 
this quantity measures the 'degree of surprise that the analyst should 
experience after isolating the smallest p-value from a long list of p- 
values calculated from a given data set.' For the relationship of the 
MinP procedure to some other MTPs and its use in the analysis of 
toxicology data, see [29] as well. 
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3.2.2. The step-down version. Corresponding to the single-step method 
described above, a step-down variant can be constructed. Following 
[28] define Hj = flig/ifo ^ C {1, . . . ,K}. Suppose again the ob- 
served p- values are pvi < ■ ■ ■ < pvK, corresponding to null hypotheses 
Hq, . . . , then define adjusted p- values 

(2) pv'j := maxpv{i^,„^K} 

where pi is given by 

pvj = P{minPVi < min pVi\Hi). 

The decision rule 'reject if pv'j < a' yields a procedure which con- 
trols the FWER at level a if the so called 'subset pivotality condition' 
is fulfilled. Subset pivotality means that the distribution of any sub- 
vector of p- values under the respective null hypotheses is unaffected by 
the truth or falsehood of the remaining hypotheses, i.e. 

(SPC) For all J C {1, ... , K} the distributions of {PVi)i^i\Hi and 
(PVi)jg/|iJ{i,...,/^} are identical. 

For the Min P approach (SPC) implies that the distribution of minjgj PVi\Hi 
and minjg/ PVi|-ff{i^...^i^} are identical, cf. [28] for the relationship of this 
method with the closure principle in multiple testing. Clearly, (SPC) 
holds if the distribution of each PVj depends only on the validity if Hq . 

Proposition 1. Let H^, . . . ,Hq he (general) hypotheses with associ- 
ated p-value rv's PVi, . . . , PVk- If the distribution of each PVj only 
depends on the validity of H^, i.e. for all j and / C {1, . . . , K} with 
j & I it holds 

(3) PV,\Hr PV,\H^, 

then (SPC) holds true. 

Proof. It holds that 

{PVi},^i\Hj ~ {PV\Hj),^i 

~ {PV,\Hl),ei by (3) 

~ {PV,\H{,_K}hei by (3) 

~ {PVi).,ei\H{i_K}. 

□ 
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Comments: 

(i) We can apply proposition 1 to the basic set-up introduced in 
section 2.2. Let PVj := PVj{Oj), where Oj denote the number 
of observed defaults in rating grade j. Clearly, for the hypothe- 
ses given by (1) condition (3) then holds true. In the sequel we 
will use the exact binomial test to calculate these p-values, but 
we note that this is not essential to our approach and that prin- 
cipally any test that controls the type I error on the test-wise 
level could be used. 

(ii) As described in [29], there is some controversy concerning the 
definition of two-sided p-values for discrete tests. Generally, 
different types of two-sided p-values will affect not only the 
observed p-values, but also their distribution and therefore also 
the MinP adjusted p- values. For the calculation in this paper 
two-sided p-values implemented in the R-function binom.test 
are used. These values are based on the sum of probabilities of 
events with smaller likelihood than the observed data, see also 
[12]. 

(iii) Note that proposition 1 is also applicable in the case of depen- 
dent p-values as long as condition (3) is satisfied. 

(iv) In order to calculate the adjusted p- values in (2), the distribu- 
tion functions 

F{i,...,K}{x) ■■= P{ min PVi < x\H{i^,„^K}) 

l£{i,...,K} 

have to determined. In general, simulation techniques will have 
to be used to accomplish this, but for the case where all observa- 
tions are independent there is a simpler way, which is described 
in the next section. 

3.2.3. Determining the MinP distribution function for independent p- 
values. If PVi, . . . , PVk are independent we have 

K 
i=l 

where Fi{x) = P{PVi < x) is the distribution function of the ith 
p-value under Let Ai := {pVit\t = l,...,mi} denote the 

ordered possible values of PVi under H^i_k}, i-e. < pva < ■■■ < 
pvinii- When the distribution of PVi is discrete, then Fi is a (right- 
continuous) step function with jump discontinuities at abscissa values 
pvii < ■ ■ ■ < pVirrn ■ If no assumption on the dependency structure of 
{PVi, . . . , PVk) is made, the Bonferroni inequality yields the following 
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conservative bound 

K 

F{^,...,K}{x) < = min(J]F,(x), 1). 

i=l 

This means that both F^^"^ ^| and F^i'^^k} ^^^P functions with 
discontinuities at the values A = U^iAi. Assume that the set A of all 
possible p-values in the experiment consists of ordered values < xi < 
■ • ■ < xm- Now define ior i = 1, . . . , K and I = 1, . . . , M 

yu := Fi{xi). 
Then it holds 

K 

1=1 

K 

Bonf ■ /V^ T\ 

yi ' := mm{2_^yu,l), 

1=1 

and the values yf"'^ resp. yf"""^ are the ordinate values of Fj^"*^ ^| resp. 

rpBonf 
^{1,...,K}- 

The approach described above for {1, . . . , K} carries over directly to 
index sets {j, . . . , K} and so the p-values needed for the determination 
of the adjusted p-values in (2) can be obtained by 

PV{i,...,K} = F|i,...,/f}(min(pt;i, . . . ,pvk)) 

where is F^^p'^ ^| if the p-values are independent or could be 

chosen conservatively as general dependency case. In 

the examples considered in section 4, the differences between the dis- 
crete Bonferroni MinP method (d-Bonf) and the discrete independence 
MinP method (d-Ind) are mostly minimal. Therefore we concentrate 
in the sequel on (d-Bonf) and the corresponding step-down method 
(sd-d-Bonf). For dealing with specific forms of dependencies, power 
can be gained by using the simulation approaches mentioned above. 
Example. To compare the MinP approach with the continuous Bon- 
ferroni resp. independence corrections we consider K = 11 hypotheses 
given by : Oj ~ Bin(nj,pj) with 

{m,...,nu) =(31,17, 7,8,7,6,7,2,5,8,2), 

(pi, . . . =(0.00015, 0.0003, 0.00060, 0.0011, 0.002, 0.0035, 0.006, 
0.0105,0.0185,0.0325,0.057). 
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Figure 1 shows the uniform and discrete versions of the distribution 
functions F^^^'^ and -^{^""''ii} where the discrete versions were ob- 
tained by the method described above. (For the uniform (continous) 
case we have Fi{x) = x for x E [0, 1].) 

[Fig. 1 about here] 

This figure shows that the difference between the continuous and the 
discrete approach is considerable, whereas in either case there seems to 
be no relevant difference between the independence or the slightly more 
conservative Bonferroni correction. For the usual significance level a = 
.05 the MinP-based critical value (under independence or Bonferroni 
dependence) is c^^^ ~ 0.0139, i.e. H^i uy is rejected if the minimum 
p-value observed in the eleven hypotheses tests is less than or equal to 
this value. For the continuous approaches we have Cg^g ^ 0.0047 and 
Cq q^^ ~ 0.0045. From the viewpoint of the continuous approaches, the 
Min P-based critical value therefore corresponds to an effective number 
of three tests instead of eleven. 

As stated in [:: /], the benefit of the MinP approach generally 'de- 
pends on the specific characteristics of the discrete distributions. Larger 
gains are possible when K is large, and where many variables are 
sparse'. 

3.3. False discovery rate. Instead of controlling the FWER, the al- 
gorithm of Benjamini and Hochberg (["']) and related methods seek 
control of the 'false discovery rate' (FDR), where a false discovery oc- 
curs whenever a null hypothesis is erroneously rejected. Let mo denote 
the (unknown) number of true hypotheses, V the number of true hy- 
potheses that are erroneously rejected by some given MTP, let R be 
the total number of rejected hypotheses and set Q := V/ max(P, 1). 
Then the FDR is defined as FDR = E{Q). When all null hypotheses 
are true, then FDR = FWER and when mo < K, then FDR < FWER, 
see [2]. Hence, any procedure that controls FWER also controls FDR, 
but if only control of FDR is desired, these methods are potentially 
much more powerful than the methods described in the preceding sec- 
tions, especially when the number of tests is large. In the context of 
PD validation they could serve as explorative tools as mentioned in the 
introduction. 

The Benjamini- Hochberg (BH) procedure consists of rejecting Hq, . . . , Hq 
where k is determined by 

k = max{i\pvi < — ■ a}. 
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If no such i exists, no hypothesis is rejected. (FDR-) adjusted p-values 
are defined in step-down fashion (cf. [25]): 

PVk-2 ■= min(pt;^_i, ^ _^ ■pvK-2), 

pv[ := min(j9t>2, K ■ pvi). 

The (BH) procedure then consists of rejecting all hypotheses with pvj < 
a. If the underlying rv's PVi, . . . , PVk are independent then it can be 
shown that FDR < itlq ■ a/K holds true, with equality holding if the 
test statistics are continuous (cf. [O, Theorem 5.1]), and since mo < K, 
the FDR is controlled at level a. 

For discrete data, specific modifications of the (BH) procedure have 
been proposed by Pounds and Chen (cf. [22]) resp. Gilbert (cf. [ ]) 
in the context of analysing gene expression resp. genetics data. The 
method of Pounds and Chen is derived under the assumption that the 
proportion rrio/K of true hypotheses to the total number of hypotheses 
is sufficiently small, i.e. it is assumed that P((PVi + ■ ■ ■ + PVk)/ K > 
1/2) 0. Since the number is a relatively small number in our 
applications, this appears to be an inappropriate restriction. Gilbert's 
modification of (BH) uses Tarone's method which lacks a-consistency, 
as noted in the beginning of this section. For these reasons we have 
refrained from evaluating these methods in sections 4 and 5. 

The power of the (BH) procedure can be increased by suitably esti- 
mating mo and then incorporating this estimate by applying (BH) to 
a' := a ■ K/ttiq instead of a, if mo > 0. This results in the adaptive 
BH algorithm, which we denote hereafter by (a-BH). The particular 
estimator m-o used here is motivated by a graphical approach originally 
proposed by Schweder and Sp0tvoll (cf. [3] for more details). Further 
adaptive FDR procedures which may yield more power are described 
in [4] but are not investigated here. 

Although in this paper we are only concerned with independent p- 
values, we note that there are some results resp. modifications available 
for dealing with dependent p- values. Benjamini and Yekutieli show that 
under the most general dependency structure the (BH) procedure still 
controls the FDR at level a ■ (1 + 1/2 + 1/3 + ■ ■ ■ + 1/K) , cf. [5, Theorem 
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1.3]. In [ • they also investigate simulation based approaches which 
allow more specific modelling of dependency structures. 

3.4. Comments. The MTPs introduced in this section provide fiexi- 
ble and versatile tools for the task of PD validation. Conceptually they 
allow a separation of the validation process into two steps. 

(1) In the first step, tests for the individual calibration hypotheses 
are carried out. This is a step which has to be performed in 
any case due to regulatory requirements. The results of these 
tests can be presented in terms of p- values. As noted in [29], 
using p- values instead of the original test statistics may be more 
appropriate when tests are discrete, since p- values are on the 
same scale, whereas test statistics, which are often based on 
counts, are generally not. 

The only general requirement at this stage is that the em- 
ployed tests should be as powerful as possible, subject to the 
control of the type I error. But apart from this, the MTP 
approach allows liberty in the choice of which specific test is 
used. For instance, it would also be possible to accomodate 
traffic light approaches, as long as the results can be expressed 
in terms of p- values. 

(2) In the second step, the marginal p- values are combined by some 
appropriate MTP like (Bonf), (d-Bonf) or (BH), yielding multi- 
plicity adjusted p-values resp. rejected calibration hypotheses. 

An additional advantage of using approaches based on p-values, is that 
this provides a fiexible and natural way of dealing with heterogeneous 
tests over different classes. In particular, it answers the question of 
Huschens mentioned in the introduction: If data is sparse in some 
classes and rich in some others it is possible to use e.g. asymptotic 
tests for the data rich classes while using exact tests for the others. 

4. Simulation study 

In this section we introduce a simple prototype credit portfolio and 
two types of misclassification matrices that will describe incorrect PD 
calibration. With these components we will assess the performance of 
MTPs for identifying conspicuous rating classes and for rejecting the 
global hypothesis. 

For a compact description of the results, we group the MTPs from 
section 3 in the following way: 

• group A consists of the Bonferroni-based procedures (Bonf), 
(Hoi) and (Hom), 
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• group B consists of the FDR procedure (BH) and its adaptive 
modification (a-BH), and 

• group C consists of the MinP methods (d-Bonf) and (sd-d- 
Bonf) (and in some instances also (d-Ind)). 

All calculations in this paper were done using the statistical software 
package R, see [23]. For all grade-wise tests the exact binomial test 
implemented in the R-function binom.test was used. The (Bonf), (Hoi), 
(Hom) and (BH) adjusted values were calculated using the R-function 
p. adjust. For (a-BH) the R-package fdrtool was used for estimating mo. 
The code for (d-Bonf), (d-Ind) and (sd-d-bonf) was implemented by 
the author. 

4.1. Example portfolio and misclassification matrices. 

4.1.1. Example portfolio. We consider a rating system consisting of 
K = 14 classes. As described in section 2, we assume that borrowers 
are assigned to one of these classes by some given model or mecha- 
nism. The credit portfolio consists of given true PDs Pi, ■ ■ ■ ,Pu and 
some probability distribution gi , . . . , gi4 of borrowers to rating classes 
(cf. table 3). These components make up an artificial example but 
nevertheless contain some typical features. The number = 14 of 
classes may seem large, but even K = 25 classes are not uncommon. 
The S&P rating system considered in the empirical study in section 
5, for instance, uses 17 rating classes. Another typical feature is the 
sparsity of data: Most of the default probabilities are rather small and 
the numbers of borrowers may also be small in several classes. Specifi- 
cally, it is to be expected that there may be numerous classes where the 
distribution of test statistics is quite discrete and asymptotic methods 
may not be valid. 

In the simulation experiments we will consider 10 portfolio sizes of 
NpF = 100, . . . , 1000. The portfolios are class-wise increasing in the 
sense that if nj{NpF) denotes the number of borrowers in the portfo- 
lio with true pd pj we have nj(lOO) < nj(200) < ■■■ < nj(lOOO) for 
j = 1, . . . , 14 and the relative frequencies of borrowers within the rat- 
ing classes is roughly equal to qi, . . . ,qi4. For each Npp we only draw 
one realisation of the portfolio, i.e. we ignore the sampling variabil- 
ity that arises from drawing finite sample sizes of borrowers from the 
distribution. 

4.1.2. Two types of misclassification schemes. We introduce two simple 
types of models for the misclassification matrix from section 2. 
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Upgrade-downgrade model. The upgrade-downgrade model A^^ = N^{s) 
is parametrized by a shift parameter s G {—K, —K + 1, . . . , i^' — 1, K}. 
Each borrower is upgraded by s, i.e. if g is the true rating grade of a 
borrower he/she will be classified to group 




if ^ - s < 0, 

iig-se{l...,K}, 

iig-s>K+l. 



resp. for the estimated PD's it holds 

{Pi if z — s < 0, 

iii- s e {!. . . ,K}, 
Pk iii-s>K + l. 

This model respresents a systematic, monotone error in the rating sys- 
tem, resp. PD estimation (cf. [21]). If s > each borrower is rated too 
optimistically (low rating classes corresponding to low default prob- 
abilities) resp. for s < too pessimistically. For s = the ideal 
classification resp. estimation is obtained. 

Example. The matrix below gives an example for Npp = 300 and s = 
-3. 



(4) N\-3) = 



/2 
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In this case the 5, 14 and 22 borrowers from rating classes 2, 3 and 4 
are upgraded to rating class 1, 46 borrowers from grade 5 are assigned 
to class 2 etc. 

Even though s is a metric variable, it may be more appropriate to 
interpret its influence on A^^ in an ordinal way, i.e. A^^(2) is more 
pessimistic than A^^(l) but not twice as pessimistic. 
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Dispersion model. The other scenario we consider is the model A^^ = 
N'^{h) with dispersion parameter h > where 

=nl{h) = [rii-Wijih)] 
where [■] denotes rounding and the wheights are defined by 

WiAn) = — fy 

J:Lw,{\^-l\) 

and Wh is defined by the function 

Wh{x)=ip{x/h), 

where ip denotes the density function of N(0, 1). This means that 
the matrix Wh converges for /i — )■ against the identity matrix, the 
respective N"^ representing the ideal classifier, and in the worst case, 
for /i — )■ oo, the number of true borrowers per rating class is dispersed 
roughly uniformly over all rating classes. Due to rounding differences, 
the total number of borrowers may change (moderately) for different 
values of h. This model represents a random error in the sense that as 
h increases, the classification becomes increasingly imprecise. As in the 
case of A^^, this model is only intended as a simple way of obtaining a 
certain kind of misclassification. 

4.2. Identification of conspicuous rating classes. We now apply 
the MTPs introduced in section 3 to the problem of identifying con- 
spicuous rating classes, i.e. rejecting single hypotheses H^. In most 
cases, groups A, B and C show quite distinct behavior. 

4.2.1. Numerical example for a single sample of defaults. We begin by 
describing the way the discretized Min P methods work for a concrete 
numerical example as given by table 1. Suppose we have Npp = 300 
borrowers and the misclassification is given by the matrix A^^(— 3) from 
(4) in the example above. The entries of A'^^(— 3) together with the 
mapping of rating classes to default probabilities yield the first three 
rows of table 1. The 300 borrowers have been classified into 10 out of 
14 possible classes. Suppose the observed validation sample is given 
by row 4 of this table, resulting in p-values for the exact (two-sided) 
binomial test in row 5. The rest of the table consists of the adjusted 
p- values produced by the various multiple testing procedures described 
in section 3. Within group A it holds that (Hom) is more powerful 
than (Hoi) which is more powerful than (Bonf), which is a known 
general result (cf. [ ''']). However, even (Hom) does not reject any 
of the hypotheses. For both methods in B identical results hold, i.e. 
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class j 


1 


2 


3 


4 


5 


6 


7 


8 


9 


11 




43 


46 


39 


39 


43 


32 


26 


14 


16 


2 


pdj 


0.0001 


0.0003 


0.0006 


0.0011 


0.0020 


0.0035 


0.0060 


0.0105 


0.0185 


0.0570 


defaults 





1 





1 





1 


1 


2 


1 


1 


p- values 


1.0000 


0.0137 


1.0000 


0.0420 


1.0000 


0.1061 


0.1448 


0.0092 


0.2583 


0.1108 


(Bonf) 


1.0000 


0.1371 


1.0000 


0.4202 


1.0000 


1.0000 


1.0000 


0.0923 


1.0000 


1.0000 


(Hoi) 


1.0000 


0.1234 


1.0000 


0.3361 


1.0000 


0.7429 


0.7429 


0.0923 


1.0000 


0.7429 


(Horn) 


1.0000 


0.1234 


1.0000 


0.2941 


1.0000 


0.5307 


0.6457 


0.0830 


1.0000 


0.5538 


(BH) 


1.0000 


0.0685 


1.0000 


0.1401 


1.0000 


0.2215 


0.2414 


0.0685 


0.3690 


0.2215 


(a-BH) 


1.0000 


0.0685 


1.0000 


0.1401 


1.0000 


0.2215 


0.2414 


0.0685 


0.3690 


0.2215 


(d-Ind) 


1.0000 


0.0551 


1.0000 


0.1428 


1.0000 


0.2906 


0.5237 


0.0322 


0.6341 


0.3671 


(d-Bonf) 


1.0000 


0.0564 


1.0000 


0.1512 


1.0000 


0.3316 


0.7015 


0.0327 


0.9251 


0.4391 



(sd-d-Bonf) 1.0000 0.0472 1.0000 0.1291 1.0000 0.2666 0.2915 0.0327 0.3703 0.2680 
Table 1. Adjusted p- values for a single realization of 
defaults with = 300 and upgrade- downgrade alterna- 
tive with s = —3 (significant findings in boldface) 



tuq was estimated as K. The adjusted p-values are substantially lower 
than for group A but still fail to achieve significance. Within group 
C no relevant difference between (d-Bonf) and (d-Ind) is discernible 
but both procedures are outperformed by the step-down version (sd-d- 
Bonf). These procedures are able to identify one resp. two invalid PD 
estimates. 

The workings of (d-Bonf) and (sd-d-Bonf ) are illustrated in figure 2 
for the two smallest p-values pvg and pv2 (represented by ticks on the 
X-axis) . 

[Fig. 2 about here] 

For (d-Bonf) the distribution function -F{i,...,9,ii}, represented by the 
solid line is determined by the method described in section 3.2.3. Ob- 
viously, it holds -F{i,...,9,ii}(pf8) < 0.05 but -F{i,...,9,ii}(pf2) > 0.05, 
so this procedure only rejects Hq. The procedure (sd-d-Bonf) starts 
with -F{i,...,9,ii}(pw8) as well, thereby rejecting Hq. In the second step, 
-^{1,... ,7,9, 11}; represented by the dashed line, is determined which now 
yields a barely significant result for pv2- In the successive steps, func- 
tions -F{i,3,...,7,9,ii}, -^{1,3,5,. ..,7,9,11}, • • • are determined, resulting in values 
P'^{i,3,...,7,9,ii}5 ^"^{1,3,5,... ,7,9,11}) •• • and the adjusted p-values defined by 
(2) are listed in the last row of table 1. 

4.2.2. Simulation results for a single portfolio and misclassification ma- 
trix. For the observed defaults in table 1 the highest number of invalid 
rating classes were identified by (sd-d-Bonf), the second most by (d- 
Bonf ) and fewer by all other procedures. It would be interesting to see 
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if this picture is due to the specific observation or is representative of 
the general situation. In order to investigate this, 10000 observations 
with the true default probabilities were simulated and for each simula- 
tion the testing procedures were evaluated as in table 1. A summary 
of these results is given in figure 3. 

[Fig. 3 about here] 

For the ability to identify validation violations, the findings from 
table 1 still basically hold true, i.e.: 

• the procedures from group A possess the lowest power, 

• group B outperforms group A 

• group C outperforms group B except for classes 2 and 3, where 
(BH) and (a-BH) are better than (d-Bonf) and (d-Ind) 

• the (sd-d-Bonf) procedure is superior to all other procedures. 

Note also that the main classes identified as questionable are classes 
with high ratings, i.e. relatively high default probabilities. For the 
low rating classes the relatively large sample sizes are not able to com- 
pensate for the loss of power resulting from the extremely low default 
probabilities. 

Another measure for comparing the relative power of the procedures 
is given by the average number of rejections. For the simulations un- 
derlying figure 3 the results are listed in table 2. Again, the result is 
consistent with previous analyses: group A constitutes the least pow- 
erful, group C the most powerful methods. Within the latter group 
(d-Bonf) and (d-Ind) perform similarly, (sd-d-Bonf) performs best, al- 
beit with only a slight advantage. An intermediate position is taken 
by group B. 



(Bonf) 


(Hoi) 


(Hom) 


(BH) 


(a-BH) (d-Bonf) 


(d-Ind) 


(sd-d-Bonf) 


0.5625 


0.5715 


0.5836 


0.7589 


0.7614 0.9962 


0.9962 


1.0309 



Table 2. Average number of rejections for Npp = 300 



and upgrade- downgrade alternative with s = —3 



4.2.3. Simulation results for average number of rejections. 
Upgrade- downgrade misclassification. In table 2 the average number of 
rejected hypotheses was given for a specific portfolio size and a specific 
shift value in the upgrade- downgrade model. Figure 4 illustrates cor- 
responding simulation results for varying portfolio sizes 100, . . . , 1000 
and shifts s = —5, . . . , 5. 

[Fig. 4 about here] 
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Again, the results within the groups are in hne with previous analyses. 
For portfolio sizes up to 400, group C outperforms group B, while for 
larger portfolios the situation is more ambivalent. For negative shifts, 
i.e. pessimistic ratings, the best procedure in C appears to be somewhat 
superior to the best procedure in B and vice versa for positive shifts. 
Dispersed misclassification. For this type of alternative the results are 
similar to those of the upgrade-downgrade alternative. As illustrated 
in figure 5, group A is uniformly outperformed by groups B and C. 
For portfolio sizes up to 600, group C outperforms group B, while for 
bigger portfolios the procedures in B are superior to C, especially for 
large values of the dispersion parameter. 

[Fig. 5 about here] 

4.3. Tests for the global calibration hypothesis. Next we investi- 
gate the power of some of the methods from section 3 for the problem 
of testing the global hypothesis Hq = HqH - ■ -nHQ , i.e. the probability 
of rejecting at least one hypothesis when at least one of the calibration 
hypotheses is false. Since we are interested only in the probability of 
rejecting at least one hypothesis, it suffices to consider only (Bonf ) and 
(Hom) from group A and (d-Bonf) from group C as well as (BH) and 
(a-BH) from group B. We study the power of these procedures for the 
upgrade-downgrade and the dispersion setting introduced in section 
4.1. Additionally, we compare these results to the power of (HL) for 
detecting violation of Hq. For each combination of s and the cor- 
responding misclassification matrix was generated and Ngim = 10000 
simulations of default numbers Oj for classes with nj ^ were carried 
out. This means that the standard error is bounded by 0.5%. 
Results for upgrade-downgrade misclassification. Figure 6 depicts the 
simulated rejection probabilities in the case of alternatives of the upgrade- 
downgrade type for shifts s = —5, . . . , 5 and for number of borrowers 
NpF = 100, . . . , 1000. 

[Fig. 6 about here] 

It shows that in most constellations the procedures from groups A 
and B perform comparably. Again, (d-Bonf) seems to be the most 
powerful of the multiple testing procedures investigated here. It always 
outperforms the procedures from A and B. For positive values of s, all 
MTPs seem to be superior to (HL), for negative values of s it is vice 
versa, with (d-Bonf) still performing relatively similar to (HL). 
Results for dispersed misclassification. As in the case of upgrade-downgrade 
misclassification there seems to be little difference in the power of the 
procedures from group A and B, cf. figure 7. 
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[Fig. 7 about here] 

Again, the (d-Bonf) method seems to outperform both other groups. 
For small sample sizes up to Npp = 300, the (HL) test outperforms all 
MTPs. For greater sample sizes, all MTPs seem to superior to (HL) 
for large values of the dispersion parameter. 

5. Empirical study 

In this section we apply MTPs to empirical default data, using cor- 
porate data and PD estimates presented in Blochinger and Leippold 
(cf. [7] for more details). Table 4 presents two PD estimates for S&P 
rating classes based on the duration and cohort approach. The esti- 
mates were obtained using training data from 1981 to 2002 and we 
follow Blochinger and Leippold and perform backtesting for the years 
2003 to 2008. While they focus on the overall calibration resp. cali- 
bration concerning level and shape, our goal is again to identify which 
of the 17 rating classes are miscalibrated. 

Column r in tables 5 and 6 lists which of the MTPs detected miscal- 
ibrated PDs. As in the simulation studies of section 4, the procedures 
from group A are inferior to those from groups B and C. For both 
types of PD estimates, group B is able to identify some additional con- 
spicuous PD estimates as compared to group C. Note also, that there 
are several classes, where (a-BH) performs strictly better than (BH). 
Hence group B outperforms group C for these validation samples. The 
MTP analysis could seem to suggest that except for the year 2008, 
miscalibration is mainly a feature of the rating classes with high PD. 
This conclusion may again be questionable in view of the low power 
for the high rating classes (see also the analysis in 4.2.2). 

[Table 5 about here] 

[Table 6 about here] 

If the same procedures are used to test the overall calibration hypoth- 
esis, then the findings are for the major part similar to the results 
described for the independence case in [i]: For the years 2004-2008 
(HL), (BL) and all MTPs produce significant findings at the 5% level. 
For 2003 none of the MTP, nor the (exact) HL test detect any miscal- 
ibration. Only the BL test is able to reject the calibration hypothesis 
for the PD estimates derived from the cluster approach. 

The BL test has the advantage that if the assumed two component 
model holds true, it may be possible to identify the component (s) that 
lead to rejection of the calibration hypothesis. Note however that this 
need not always be the the data for the cluster PD estimates in 
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2003 illustrates. Here both the level and shape components are insignif- 
icant (both p-values equal 0.2061) while the global test is significant 
(p- value 0.0243). Approaches based on MTPs on the other hand are 
more non parametric in nature, i.e. no parametric model in the sense 
of the two component model in [7] is assumed. Since MTPs provide 
a per-class assessment, this means that they may give more detailed 
information than the BL test. The BL test on the other hand has the 
appeal of providing results that can be interpreted in terms of level and 
shape within the assumed model. 



In this section we sketch, how the Min P methods used in this paper 
for independent defaults can be extended to dependent defaults. A 
more detailed description and analysis will be the subject of future 
work. 

In the one-factor model, which is also used in the IRB-approach of 
Basel II, the credit worthiness of each borrower i is modeled as 



with < p < 1 where Z, t/i, . . . , t/^v ~ N(0, 1) iid, cf. e.g. p"]. The 
risk factor Z denotes the systematic risk component which is com- 
mon to all borrowers and Ui is the idiosyncratic risk that is specific to 
borrower i. The 'asset correlation' p describes the dependency of indi- 
vidual defaults on the systematic risk component. Accordingly, default 
indicators Xij for borrower i in rating class j can be defined by 



where is the quantile function of N(0,1). Note that the Xij ~ 
Bin(l,pj) but they are no longer independent. Assume that the test 
statistics Tj resp. p values PVj per class are measurable functions of 
Sj = [Xij, . . . ,Xnjj)- Then the MinP approach can be implemented 
as follows: 

(1) Obtain the p value functions PVj = PV{sj) = PVj{xij, . . . , Xnjj)- 

(2) Obtain the distribution function of the PVj's. 

(3) Obtain the distribution function ..^/^i, of min(PVi, . . . , PVk)- 



6. Extension to dependent defaults 




else. 
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(4) For a given sample of defaults (si, . . . , sk) calculate marginal p 
values 

pvi = PV{si) = PVi(a;ii, . . .,Xnii) 

pvK = PV{sk) = PVk{xik, Xuj^k) 
resp. adjusted p values 
pv[ = F{i^,„^K}{pvi) 

P^'k = F{i,...,k}{pvk) 

and continue in the spirit of section 3.2. 

In steps 1 through 3 it may not be feasible to obtain exact quantities. 
In these cases, simulation can be used to obtain sufficiently accurate 
estimates. 

Since by assumption, PVj is a function of Sj = {Xij, . . . ,Xnjj) and 
the distribution of this vector depends only on pj (and the fixed asset 
correlation), proposition 1 shows that (SPC) is fulfilled and therefore 
the procedure sketched above also maintains control of the FWER. 

7. Discussion 

In this paper we have applied MTPs to testing the calibration of 
PD estimates in credit rating systems with a view towards identify- 
ing miscalibrated PD estimates. We have considered procedures that 
control FWER and FDR and have investigated their performance in a 
simulation setting and for empirical data. 

For FWER, the results show that the power of 'standard' proce- 
dures can be substantially improved by MinP methods, which take 
the discreteness of data into account. These methods also perform 
well as tests of the overall calibration hypothesis. In addition, we have 
used the more explorative FDR methodology for identifying conspicu- 
ous PD estimates. In the simulation study, the power of these methods 
was roughly comparable to the Min P methods, while for the empirical 
data they outperformed the MinP methods. This may be due to the 
higher number of rating grades resp. tests performed. Note also that 
no attempt was made to adapt the FDR procedures to the discrete- 
ness of the data. If this were done in an appropriate way, their power 
might be considerably enhanced. In this sense the presentation given in 
this paper is somewhat biased against the FDR approach. Altogether, 
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we conclude that in the framework of independent defaults considered 
here, MTPs can serve as helpful tools for identifying miscalibrated resp. 
conspicuous PD estimates. 

We have also seen that MTPs frequently perform well as tests for 
the global calibration hypothesis. The question whether to use these 
methods or a global test has been discussed in the statistical literature. 
Westfall and Wolfinger sum up the situation as follows (see [29]): 

'The global test will have higher power in situations where 
there is a mild departure from the null for many tests, 
whereas our [MinP ] methods have high power, when 
there are marked departures from the null at only a few 
sites [tests]. Regardless of power comparisons, a major 
problem with global tests is their failure to isolate specific 
significances. ' 

While we have concentrated on the independence case for the sake of 
illustrating the main ideas as simply as possible, credit default events 
are usually not independent and procedures that aim to identify mis- 
calibrated PD estimates should take this into account. One possible 
approach would consist of using the Bonferroni variant of the MinP 
method resp. the Benjamini-Yektutieli variant of the (BH) approach 
as 'worst case' types of dependency. However, this may be overly con- 
servative. It seems more promising to extend the MinP method as 
described in section 6 in order to account for specific forms of depen- 
dencies. This will be the subject of future work. 

Acknowledgements. The author would like to thank Marcus R.W. 
Martin for helpful comments. 
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Tables 



class 


i 


1 


2 


3 


4 


5 


6 


7 


100 ■ 


Pi 


0.015 


0.03 


0.06 


0.11 


0.2 


0.35 


0.6 






0.009 


0.014 


0.053 


0.070 


0.133 


0.133 


0.164 




class 


i 


8 


9 


10 


11 


12 


13 


14 


100 ■ 


Pi 


1.05 


1.85 


3.25 


5.7 


10.0 


17.5 


33.8 






0.149 


0.123 


0.077 


0.046 


0.020 


0.008 


0.002 



Table 3. True PDs and probability distribution of bor- 
rowers to (true) rating classes in the simulation study 



S&P rating 


PD (Duration) 


PD (Cluster) 


AAA 


0.02 


1.00 


AA+ 


0.05 


1.00 


AA 


0.43 


1.00 


AA- 


0.44 


3.84 


A+ 


0.46 


5.20 


A 


0.84 


6.49 


A- 


1.00 


6.49 


BBB+ 


4.67 


31.37 


EBB 


11.65 


36.23 


BBB- 


14.53 


40.12 


BB+ 


33.01 


55.01 


BB 


45.64 


116.33 


BB- 


88.51 


207.18 


B+ 


175.41 


349.80 


B 


758.33 


982.01 


B- 


1343.30 


1430.16 


CCC 


4249.04 


2853.54 



Table 4. Estimated probabilities of default (PD in bps, 
i.e. 1/100%) for the duration and cluster approaches for 
the S&P data from [7] 
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Figures 
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Figure 1. Distribution functions of Fj^"*^ and 
F^i"'\iy in the continuous and the discrete case 
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Figure 2. Distribution functions for Min P for the first 
and second step in (sd-d-Bonf) 
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Figure 3. Simulated probabilities of rejecting null hy- 
potheses corresponding to the rating classes for N = 300 
and upgrade-downgrade alternative with s = — 3 
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